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Abstract 

We present a framework for learning in hidden Markov models with distributed state 
representations. Within this framework, we derive a learning algorithm based on the 
Expectation-Maximization (EM) procedure for maximum likelihood estimation. Anal- 
ogous to the standard Baum- Welch update rules, the M-step of our algorithm is exact 
and can be solved analytically. However, due to the combinatorial nature of the hidden 
state representation, the exact E-step is intractable. A simple and tractable mean held 
approximation is derived. Empirical results on a set of problems suggest that both the 
mean held approximation and Gibbs sampling are viable alternatives to the computa- 
tionally expensive exact algorithm. 
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1 Introduction 

A problem of fundamental interest to machine learning is time series modeling. Due to the sim- 
plicity and efficiency of its parameter estimation algorithm, the hidden Markov model (HMM) has 
emerged as one of the basic statistical tools for modeling discrete time series, finding widespread 
application in the areas of speech recognition (Rabiner and Juang, 1986) and computational molec- 
ular biology (Baldi et al., 1994). An HMM is essentially a mixture model, encoding information 
about the history of a time series in the value of a single multinomial variable (the hidden state). 
This multinomial assumption allows an efficient parameter estimation algorithm to be derived (the 
Baum- Welch algorithm). However, it also severely limits the representational capacity of HMMs. 
For example, to represent 30 bits of information about the history of a time sequence, an HMM 
would need 2 30 distinct states. On the other hand an HMM with a distributed state representa- 
tion could achieve the same task with 30 binary units (Williams and Hinton, 1991). This paper 
addresses the problem of deriving efficient learning algorithms for hidden Markov models with 
distributed state representations. 

The need for distributed state representations in HMMs can be motivated in two ways. First, such 
representations allow the state space to be decomposed into features that naturally decouple the 
dynamics of a single process generating the time series. Second, distributed state representations 
simplify the task of modeling time series generated by the interaction of multiple independent 
processes. For example, a speech signal generated by the superposition of multiple simultaneous 
speakers can be potentially modeled with such an architecture. 

Williams and Hinton (1991) first formulated the problem of learning in HMMs with distributed 
state representation and proposed a solution based on deterministic Boltzmann learning. The ap- 
proach presented in this paper is similar to Williams and Hinton's in that it is also based on a 
statistical mechanical formulation of hidden Markov models. However, our learning algorithm is 
quite different in that it makes use of the special structure of HMMs with distributed state rep- 
resentation, resulting in a more efficient learning procedure. Anticipating the results in section 2, 
this learning algorithm both obviates the need for the two-phase procedure of Boltzmann machines, 
and has an exact M-step. A different approach comes from Saul and Jordan (1995), who derived 
a set of rules for computing the gradients required for learning in HMMs with distributed state 
spaces. However, their methods can only be applied to a limited class of architectures. 

2 Factorial hidden Markov models 

Hidden Markov models are a generalization of mixture models. At any time step, the probability 
density over the observables defined by an HMM is a mixture of the densities defined by each state 
in the underlying Markov model. Temporal dependencies are introduced by specifying that the 
prior probability of the state at time t depends on the state at time t — 1 through a transition 
matrix, P (Figure la). 

Another generalization of mixture models, the cooperative vector quantizer (CVQ; Hinton and 
Zemel, 1994 ), provides a natural formalism for distributed state representations in HMMs. Whereas 
in simple mixture models each data point must be accounted for by a single mixture component, 
in CVQs each data point is accounted for by the combination of contributions from many mixture 
components, one from each separate vector quantizer. The total probability density modeled by a 
CVQ is also a mixture model; however this mixture density is assumed to factorize into a product 
of densities, each density associated with one of the vector quantizers. Thus, the CVQ is a mixture 
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model with distributed representations for the mixture components. 

Factorial hidden Markov models 1 combine the state transition structure of HMMs with the dis- 
tributed representations of CVQs (Figure lb). Each of the d underlying Markov models has a 
discrete state s* at time t and transition probability matrix P 8 -. As in the CVQ, the states are mu- 
tually exclusive within each vector quantizer and we assume real-valued outputs. The sequence of 
observable output vectors is generated from a normal distribution with mean given by the weighted 
combination of the states of the underlying Markov models: 

where C is a common covariance matrix. The A;-valued states s 8 - are represented as discrete column 
vectors with a 1 in one position and everywhere else; the mean of the observable is therefore a 
combination of columns from each of the Wi matrices. 
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Figure 1. a) Hidden Markov model, b) Factorial hidden Markov model. 



We capture the above probability model by defining the energy of a sequence of T states and 
observations, {(s f , y f )}^_ l7 which we abbreviate to {s,y}, as: 



z t=i 



y f -E^ 



8 = 1 



c- 1 
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X d 



T,T,*?M-\ 



ii 



t = l 8 = 1 



where [Ai\ji = logP(s*|s* ; 1 ) such that J2 1= i e*- A ''J l = 1, and ' denotes matrix transpose. Priors 
for the initial state, s 1 , are introduced by setting the second term in (1) to — J2i=i s j hog 71 ";- The 
probability model is dehned from this energy by the Boltzmann distribution 

P({s,y}) = iexp{-'H({s, y })}. (2) 



: We refer to HMMs with distributed state as factorial HMMs as the features of the distributed state factorize the 
total state representation. 



Note that like in the CVQ (Ghahramani, 1995), the undamped partition fundion 

{s} 
evaluates to a constant, independent of the parameters. This can be shown by hrst integrating the 
Gaussian variables, removing all dependency on {y}, and then summing over the states using the 
constraint on &- Ai w . 

The EM algorithm for Factorial HMMs 

As in HMMs, the parameters of a factorial HMM can be estimated via the EM (Baum- Welch) 
algorithm. This procedure iterates between assuming the current parameters to compute proba- 
bilities over the hidden states (E-step), and using these probabilities to maximize the expected log 
likelihood of the parameters (M-step). 

Using the likelihood (2), the expected log likelihood of the parameters is 

Q(<T» = <-tt({s,y}) -logZ> c , (3) 

where cj) = {Wi, Pi, C}f =1 denotes the current parameters, and (-) c denotes expectation given the 
clamped observation sequence and </>. Given the observation sequence, the only random variables are 
the hidden states. Expanding equation (3) and limiting the expectation to these random variables 
we hnd that the statistics that need to be computed for the E-step are (s*) c , (s*s* ) c , and (s*s*~ ) c . 
Note that in standard HMM notation (Rabiner and Juang, 1986), (s*) c corresponds to 7 t and 
(s*s* _1 ) c corresponds to <5, whereas (s*s* ) c has no analogue when there is only a single underlying 
Markov model. The M-step uses these expectations to maximize Q with respect to the parameters. 
The constant partition function allowed us to drop the second term in (3). Therefore, unlike 
the Boltzmann machine, the expected log likelihood does not depend on statistics collected in an 
undamped phase of learning, resulting in much faster learning than the traditional Boltzmann 
machine (Neal, 1992). 

M-step 

Setting the derivatives of Q with respect to the output weights to zero, we obtain a linear system 
of equations for W: 



W a 



YM)J 



N,t 



_N,t 

where s and W are the vector and matrix of concatenated s 8 - and Wi, respectively, J2n denotes 
summation over a data set of N sequences, and f is the Moore- Penrose pseudo-inverse. To estimate 
the log transition probabilities we solve dQ/d[Ai\ji = subject to the constraint J2j e = 1? 

obtaining 



J2N,t\ S ij S i 
z2N,t,j\ s ij S il /c, 



imt = ^g ( z N 'Vri; . (4) 



The covariance matrix can be similarly estimated: 

C new = £yy'-£y<s>' c <ss'> c t< s)cy '. 

N,t N,t 

The M-step equations can therefore be solved analytically; furthermore, for a single underlying 
Markov chain, they reduce to the traditional Baum- Welch re-estimation equations. 
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E-step 

Unfortunately, as in the simpler CVQ, the exact E-step for factorial HMMs is computationally 
intractable. For example, the expectation of the j th unit in vector i at time step t, given {y}, is: 

(s%) c = P{s% = l\{y},4>) 

k 

= £ P(s[ j =l,...,sl j = l,...,s* dij =l\{yU) 

Although the Markov property can be used to obtain a forward-backward-like factorization of this 
expectation across time steps, the sum over all possible configurations of the other hidden units 
within each time step is unavoidable. For a data set of N sequences of length T, the full E-step 
calculated through the forward-backward procedure has time complexity 0(NTk 2d ). Although 
more careful bookkeeping can reduce the complexity to 0(NTdk d+1 ), the exponential time cannot 
be avoided. This intractability of the exact E-step is due inherently to the cooperative nature of 
the model — the setting of one vector only determines the mean of the observable if all the other 
vectors are fixed. 

Rather than summing over all possible hidden state patterns to compute the exact expectations, 
a natural approach is to approximate them through a Monte Carlo method such as Gibbs sampling. 
The procedure starts with a clamped observable sequence {y} and a random setting of the hidden 
states {s* }. At each time step, each state vector is updated stochastically according to its probability 
distribution conditioned on the setting of all the other state vectors: s* ~ P(s*|{y},{sJ : j ^ 
i or r / t},(f>). These conditional distributions are straightforward to compute and a full pass 
of Gibbs sampling requires O(NTkd) operations. The hrst and second-order statistics needed 
to estimate (s*) c , (s*s* } c and (s*s* _1 } c are collected using the s*-'s visited and the probabilities 
estimated during this sampling process. 

Mean field approximation 

A different approach to computing the expectations in an intractable system is given by mean held 
theory. A mean held approximation for factorial HMMs can be obtained by defining the energy 
function 

Ws, y }) = \ £ [y f - A 'c- 1 [y 4 - A - E < log mj. 

t t,i 

which results in a completely factorized approximation to probability density (2): 

P({s,y}) oc I[exp{-i [y f - /ifC" 1 [y* - ^] } II KO'" ( 5 ) 

t t,i,j 

In this approximation, the observables are independently Gaussian distributed with mean fi f and 
each hidden state vector is multinomially distributed with mean m*. This approximation is made as 
tight as possible by chosing the mean held parameters fi f and m* that minimize the Kullback-Liebler 
divergence 

/C£(P||P) = <logP>p-<logP>p 

where (-)p denotes expectation over the mean held distribution (5). With the observables clamped, 
fi f can be set equal to the observable y f . Minimizing K,C(P\\P) with respect to the mean held 



parameters for the states results in a fixed-point equation which can be iterated until convergence: 

a{W[C- x [y f - y f ] + W^W^ - ^lag^C" 1 ^} - f (6) 



m f new 



where y f = J2i Wimj and c{-} is the softmax exponential, normalized over each hidden state vector. 
The hrst term is the projection of the error in the observable onto the weights of state vector i — the 
more a hidden unit can reduce this error, the larger its mean held parameter. The next three 
terms arise from the fact that (sfj)p is equal to m 8J and not raf-. The last two terms introduce 
dependencies forward and backward in time. Each state vector is asynchronously updated using 
(6), at a time cost of O(NTkd) per iteration. Convergence is diagnosed by monitoring the ICC 
divergence in the mean held distribution between successive time steps; in practice convergence is 
very rapid (about 2 to 10 iterations of (6)). 

3 Empirical Results 

We compared three EM algorithms for learning in factorial HMMs — using Gibbs sampling, mean 
held approximation, and the exact (exponential) E step — on the basis of performance and speed 
on randomly generated problems. Problems were generated from a factorial HMM structure, the 
parameters of which were sampled from a uniform [0, 1] distribution, and appropriately normalized 
to satisfy the sum-to-one constraints of the transition matrices and priors. Also included in the 
comparison was a traditional HMM with as many states (k d ) as the factorial HMM. 

Table 1 summarizes the results. Even for moderately large state spaces (d > 3 and k > 3) 
the standard HMM with k d states suffers from severe overhtting. Furthermore, both the standard 
HMM and the exact E-step factorial HMM are extremely slow on the larger problems. The Gibbs 
sampling and mean held approximations offer roughly comparable performance at a great increase 
in speed. 

4 Discussion 

The basic contribution of this paper is a learning algorithm for hidden Markov models with dis- 
tributed state representations. The standard Baum- Welch procedure is intractable for such archi- 
tectures as the size of the state space generated from the cross product of d A;-valued features is 
0(k d ), and the time complexity of Baum- Welch is quadratic in this size. More importantly, unless 
special constraints are applied to this cross-product HMM architecture, the number of parameters 
also grows as 0(k 2d ), which can result in severe overhtting. 

The architecture for factorial HMMs presented in this paper did not include any coupling between 
the underlying Markov chains. It is possible to extend the algorithm presented to architectures which 
incorporate such couplings. However, these couplings must be introduced with caution as they may 
result either in an exponential growth in parameters or in a loss of the constant partition function 
property. 

The learning algorithm derived in this paper assumed real- valued observables. The algorithm can 
also be derived for HMMs with discrete observables, an architecture closely related to sigmoid belief 
networks (Neal, 1992). However, the nonlinearities induced by discrete observables make both the 
E-step and M-step of the algorithm more difficult. 



Table 1: Comparison of factorial HMM on four problems of varying size 



d 


k 


Alg 


# 




Train 


Test 


Cycles 


Time/Cycle 


3 


2 


HMM 


5 


649 


± 8 


358 


± 81 


33 


± 19 


1.1 s 






Exact 




877 


± 


768 


± 


22 


± 6 


3.0 s 






Gibbs 




710 


± 152 


627 


± 129 


28 


± 11 


6.0 s 






MF 




755 


± 168 


670 


± 137 


32 


± 22 


1.2 s 


3 


3 


HMM 


5 


670 


± 26 


-782 


± 128 


23 


± 10 


3.6 s 






Exact 




568 


± 164 


276 


± 62 


35 


± 12 


5.2 s 






Gibbs 




564 


± 160 


305 


± 51 


45 


± 16 


9.2 s 






MF 




495 


± 83 


326 


± 62 


38 


± 22 


1.6 s 


5 


2 


HMM 


5 


588 


± 37 


-2634 


± 566 


18 


± 1 


5.2 s 






Exact 




223 


± 76 


159 


± 80 


31 


± 17 


6.9 s 






Gibbs 




123 


± 103 


73 


± 95 


40 


± 5 


12.7 s 






MF 




292 


± 101 


237 


± 103 


54 


± 29 


2.2 s 


5 


3 


HMM 


3 


1671,1678,1690 


-oo,- 


oo,-oo 


14,14,12 


90.0 s 






Exact 




-55, 


-354,-295 


-123,-J 


378,-402 


90,100,100 


51.0 s 






Gibbs 




-123 


,-160,-194 


-202,-S 


>37,-307 


100,73,100 


14.2 s 






MF 




-287 


,-286,-296 


-364,-J 


370,-365 


100,100,100 


4.7 s 



Table 1. Data was generated from a factorial HMM with d underlying Markov 
models of k states each. The training set was 10 sequences of length 20 where the 
observable was a 4-dimensional vector; the test set was 20 such sequences. HMM 
indicates a hidden Markov model with k d states; the other algorithms are factorial 
HMMs with d underlying Estate models. Gibbs sampling used 10 samples of each 
state. The algorithms were run until convergence, as monitored by relative change 
in the likelihood, or a maximum of 100 cycles. The # column indicates number of 
runs. The Train and Test columns show the log likelihood ± one standard deviation 
on the two data sets. The last column indicates approximate time per cycle on a 
Silicon Graphics R4400 processor running Matlab. 



In conclusion, we have presented Gibbs sampling and mean field learning algorithms for factorial 
hidden Markov models. Such models incorporate the time series modeling capabilities of hidden 
Markov models and the advantages of distributed representations for the state space. Future work 
will concentrate on a more efficient mean held approximation in which the forward-backward algo- 
rithm is used to compute the E-step exactly within each Markov chain, and mean held theory is 
used to handle interactions between chains (Saul and Jordan, 1996). 
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